Citi Bike Usage in NYC

STA 9750 Individual Final Report

Author

Sabrina Zhu

Published

December 13, 2025

Introduction

Specific Question: How does station infrastructure relate to e-bike vs classic bike usage?

This analysis explores whether the quality of bike-share infrastructure—measured by station density and proximity to bike lanes—affects which type of bike riders use. The findings contribute to the overarching group question: Within Manhattan, does bike-share usage respond more to infrastructure or to external factors?

Show the code
library(data.table)
library(sf)
library(ggplot2)
library(dplyr)
library(viridis)

Data Loading

Show the code
# Load the 5% stratified sample
citibike_sample <- readRDS("data/processed/citibike_manhattan_sample_5pct.rds")
routes <- readRDS("data/individual_report/bike_routes.rds")
manhattan_poly <- readRDS("data/gis/manhattan_polygon.rds")

cat("Loaded", format(nrow(citibike_sample), big.mark = ","), "trips\n")
Loaded 1,618,078 trips
Show the code
cat("Date range:", as.character(min(citibike_sample$date)), 
    "to", as.character(max(citibike_sample$date)), "\n")
Date range: 2024-09-29 to 2025-10-31 

Methodology: Infrastructure Classification

I classified each Citi Bike station by infrastructure quality using two metrics:

  1. Station Density: Number of other stations within 500 meters
  2. Bike Lane Proximity: Distance to the nearest bike lane

These were combined into a composite infrastructure score and categorized into Low, Medium, and High levels.

Show the code
# Get unique stations with coordinates
stations <- citibike_sample[!is.na(start_station_id) & 
                             !is.na(start_lat) & 
                             !is.na(start_lng), 
                           .(lat = first(start_lat),
                             lng = first(start_lng),
                             station_name = first(start_station_name)),
                           by = start_station_id]

cat("Found", nrow(stations), "unique stations\n")
Found 734 unique stations
Show the code
# Convert to spatial object
stations_sf <- st_as_sf(stations, coords = c("lng", "lat"), crs = 4326, remove = FALSE)

# Metric 1: Station Density (count nearby stations within 500m)
stations_buffer <- st_buffer(stations_sf, dist = 500)
stations$nearby_count <- sapply(1:nrow(stations_sf), function(i) {
  sum(st_intersects(stations_buffer[i, ], stations_sf, sparse = FALSE)) - 1
})

# Metric 2: Distance to Bike Lanes
manhattan_bbox <- st_bbox(c(xmin = -74.02, xmax = -73.90, ymin = 40.70, ymax = 40.88), 
                          crs = st_crs(4326)) |> st_as_sfc()

manhattan_routes_sf <- routes %>%
  filter(st_intersects(geometry, manhattan_bbox, sparse = FALSE)[,1]) %>%
  st_transform(4326)

stations$dist_to_bike_lane_m <- sapply(1:nrow(stations_sf), function(i) {
  distances <- st_distance(stations_sf[i, ], manhattan_routes_sf)
  min(as.numeric(distances))
})

# Create Infrastructure Score (normalized 0-1)
stations$density_score <- (stations$nearby_count - min(stations$nearby_count)) / 
  (max(stations$nearby_count) - min(stations$nearby_count))

max_dist <- quantile(stations$dist_to_bike_lane_m, 0.95)
stations$lane_proximity_score <- 1 - pmin(stations$dist_to_bike_lane_m, max_dist) / max_dist

stations$infrastructure_score <- 0.5 * stations$density_score + 0.5 * stations$lane_proximity_score

# Classify into tertiles
tertiles <- quantile(stations$infrastructure_score, probs = c(0.33, 0.67))
stations$infrastructure_level <- cut(
  stations$infrastructure_score,
  breaks = c(-Inf, tertiles[1], tertiles[2], Inf),
  labels = c("Low", "Medium", "High"),
  include.lowest = TRUE
)

cat("\nInfrastructure Classification:\n")

Infrastructure Classification:
Show the code
print(table(stations$infrastructure_level))

   Low Medium   High 
   242    250    242 
Show the code
# Save station classification
saveRDS(stations, "data/processed/stations_infrastructure.rds")

Visualization 1: Infrastructure Map

This map shows how stations are classified across Manhattan. High infrastructure (green) is concentrated in Lower Manhattan, while low infrastructure (red) is more common in Upper Manhattan.

Show the code
map1 <- ggplot() +
  geom_sf(data = manhattan_poly, fill = "gray95", color = "gray60", size = 0.3) +
  geom_sf(data = manhattan_routes_sf, color = "lightblue", alpha = 0.3, size = 0.5) +
  geom_point(data = stations, 
             aes(x = lng, y = lat, color = infrastructure_level, size = nearby_count),
             alpha = 0.7) +
  scale_color_manual(
    values = c("Low" = "#F44336", "Medium" = "#FF9800", "High" = "#4CAF50"),
    name = "Infrastructure Level"
  ) +
  scale_size_continuous(name = "Station Density", range = c(1, 4)) +
  labs(
    title = "Citi Bike Station Infrastructure",
    subtitle = "Stations classified by bike lane proximity and station density",
    caption = "Size indicates number of nearby stations within 500m"
  ) +
  theme_minimal() +
  theme(
    panel.grid = element_blank(),
    axis.text = element_blank(),
    axis.title = element_blank(),
    legend.position = "right",
    plot.title = element_text(size = 14, face = "bold"),
    plot.subtitle = element_text(size = 10, color = "gray40")
  )

print(map1)

Station-Level E-bike Analysis

Show the code
# Calculate e-bike % by station
station_variation <- citibike_sample[, .(
  ebike_pct = mean(rideable_type == "electric_bike") * 100,
  trips = .N,
  lat = mean(start_lat),
  lng = mean(start_lng)
), by = start_station_name]

# Merge with infrastructure
station_full <- merge(station_variation, 
                      stations[, .(station_name, infrastructure_level, infrastructure_score)],
                      by.x = "start_station_name", 
                      by.y = "station_name",
                      all.x = TRUE)

# Calculate bike balance (relative to average)
overall_ebike_share <- mean(citibike_sample$rideable_type == "electric_bike") * 100
station_full[, bike_balance := ebike_pct - overall_ebike_share]

cat("Overall e-bike usage:", round(overall_ebike_share, 1), "%\n")
Overall e-bike usage: 68 %

Visualization 2: Infrastructure vs E-bike Usage (Scatter Plot)

This scatter plot shows the relationship between infrastructure score and e-bike usage at the station level. Each dot is a station, colored by infrastructure level.

Show the code
scatter <- ggplot(station_full[!is.na(infrastructure_score) & trips >= 50 & ebike_pct >= 40], 
       aes(x = infrastructure_score, y = ebike_pct)) +
  geom_point(aes(color = infrastructure_level, size = trips), alpha = 0.6) +
  geom_smooth(method = "lm", color = "black", linetype = "dashed", se = TRUE) +
  scale_color_manual(values = c("Low" = "#F44336", "Medium" = "#FF9800", "High" = "#4CAF50"),
                     name = "Infrastructure Level") +
  scale_size_continuous(range = c(1, 6), name = "Total Trips") +
  labs(
    title = "Infrastructure Score vs. E-bike Usage",
    subtitle = "Each dot is a station | Dashed line shows overall trend",
    x = "Infrastructure Score (higher = better infrastructure)",
    y = "E-bike Usage (%)"
  ) +
  theme_minimal() +
  theme(plot.title = element_text(size = 14, face = "bold"))

print(scatter)

Show the code
# Calculate correlation
cor_value <- cor(station_full$infrastructure_score, station_full$ebike_pct, use = "complete.obs")
cat("\nCorrelation between infrastructure and e-bike usage:", round(cor_value, 3), "\n")

Correlation between infrastructure and e-bike usage: -0.165 

Key Finding: There is a negative correlation between infrastructure score and e-bike usage. Higher infrastructure stations show lower e-bike usage rates.

Visualization 3: Bike Type Balance Map

This map shows where e-bikes vs classic bikes dominate across Manhattan, relative to the average (68% e-bike).

Show the code
balance_map <- ggplot() +
  geom_sf(data = manhattan_poly, fill = "gray95", color = "gray60", linewidth = 0.3) +
  geom_point(data = station_full[trips >= 50],
             aes(x = lng, y = lat, color = bike_balance, size = trips),
             alpha = 0.9) +
  scale_color_gradientn(
    colors = c("#08306b", "#2171b5", "#6baed6", "#f7f7f7", "#fcbba1", "#fb6a4a", "#cb181d"),
    values = scales::rescale(c(-60, -30, -10, 0, 10, 25, 40)),
    limits = c(-60, 40),
    breaks = c(-30, 0, 20),
    labels = c("More classic", "Near avg", "More e-bikes"),
    name = "Relative E-bike Share"
  ) +
  scale_size_continuous(range = c(2, 7), name = "Station\nTrip Volume", labels = scales::comma) +
  labs(
    title = "Where Are Stations More or Less E-bike-Heavy?",
    subtitle = sprintf("Color shows difference from Manhattan average (~%.0f%% e-bikes): orange = higher, blue = lower", overall_ebike_share),
    caption = "Data: Citi Bike Manhattan trips | Stations with 50+ trips shown"
  ) +
  coord_sf(datum = NA) +
  theme_minimal() +
  theme(
    panel.grid = element_blank(),
    axis.text = element_blank(),
    axis.title = element_blank(),
    legend.position = "right",
    plot.title = element_text(size = 14, face = "bold"),
    plot.subtitle = element_text(size = 10, color = "gray40")
  )

print(balance_map)

Key Finding: Lower Manhattan (high infrastructure) shows more classic bike usage (blue), while Upper Manhattan (low infrastructure) shows more e-bike usage (orange).

Visualization 4: E-bike Patterns by Rider Type, Volume, and Time

Show the code
# Define time periods
citibike_sample[, time_period := fcase(
  hour %in% c(6,7,8,9), "Morning Rush",
  hour %in% c(10,11,12,13,14,15), "Midday",
  hour %in% c(16,17,18,19), "Evening Rush",
  hour %in% c(20,21,22,23,0,1,2,3,4,5), "Night"
)]
citibike_sample[, time_period := factor(time_period, 
  levels = c("Morning Rush", "Midday", "Evening Rush", "Night"))]

# Define volume groups
citibike_sample[, station_volume := .N, by = start_station_name]
citibike_sample[, volume_group := cut(station_volume, 
  breaks = quantile(station_volume, probs = c(0, 0.33, 0.67, 1)),
  labels = c("Low Traffic", "Medium Traffic", "High Traffic"),
  include.lowest = TRUE)]

# Calculate e-bike % by rider type, volume, and time
ebike_full <- citibike_sample[!is.na(volume_group) & !is.na(time_period), .(
  ebike_pct = mean(rideable_type == "electric_bike") * 100,
  trips = .N
), by = .(member_casual, volume_group, time_period)]

ebike_full[, rider_label := ifelse(member_casual == "casual", "Casual Riders", "Members")]
Show the code
heatmap <- ggplot(ebike_full, aes(x = time_period, y = volume_group, fill = ebike_pct)) +
  geom_tile(color = "white", size = 1.5) +
  geom_text(aes(label = paste0(round(ebike_pct, 1), "%")),
            size = 4.5, fontface = "bold", color = "white") +
  facet_wrap(~rider_label) +
  scale_fill_gradient2(
    low = "#3498db",
    mid = "#95a5a6",
    high = "#e74c3c",
    midpoint = 68,
    name = "E-bike\nUsage",
    limits = c(60, 80),
    breaks = seq(60, 80, 5),
    labels = function(x) paste0(x, "%")
  ) +
  labs(
    title = "E-bike Usage Patterns Across Multiple Dimensions",
    subtitle = "Blue = More classic bikes | Red = More e-bikes | Overall average = 68% e-bikes",
    x = "Time of Day",
    y = "Station Activity Level",
    caption = "Rider Type Distribution: Casual – 22% | Member – 78%"
  ) +
  theme_minimal(base_size = 13) +
  theme(
    axis.text.x = element_text(angle = 45, hjust = 1, size = 11),
    axis.text.y = element_text(size = 11),
    strip.text = element_text(size = 14, face = "bold"),
    plot.title = element_text(size = 16, face = "bold", hjust = 0),
    plot.subtitle = element_text(size = 11, color = "gray30"),
    plot.caption = element_text(hjust = 0, size = 9),
    panel.grid = element_blank(),
    legend.position = "right"
  )

print(heatmap)

Key Findings:

  • Highest e-bike usage (78.8%): Casual riders at low-traffic stations during evening rush
  • Lowest e-bike usage (63.5%): Members at high-traffic stations during morning rush
  • Higher e-bike usage at low-traffic stations and during evening rush, suggesting riders prefer the easier ride home after a long day, and e-bikes are available to take.

Summary Statistics

Show the code
# Rider type summary
rider_summary <- citibike_sample[, .(
  total_trips = .N,
  electric_bikes = sum(rideable_type == "electric_bike"),
  classic_bikes = sum(rideable_type == "classic_bike")
), by = member_casual]

rider_summary[, `:=`(
  pct_electric = round(electric_bikes / total_trips * 100, 1),
  pct_classic = round(classic_bikes / total_trips * 100, 1)
)]

cat("=== Rider Type Summary ===\n")
=== Rider Type Summary ===
Show the code
print(rider_summary)
   member_casual total_trips electric_bikes classic_bikes pct_electric
          <char>       <int>          <int>         <int>        <num>
1:        member     1340439         896332        444107         66.9
2:        casual      277639         203233         74406         73.2
   pct_classic
         <num>
1:        33.1
2:        26.8
Show the code
# E-bike by infrastructure
infra_summary <- station_full[!is.na(infrastructure_level), .(
  mean_ebike = round(mean(ebike_pct), 1),
  n_stations = .N
), by = infrastructure_level]

cat("\n=== E-bike Usage by Infrastructure Level ===\n")

=== E-bike Usage by Infrastructure Level ===
Show the code
print(infra_summary)
   infrastructure_level mean_ebike n_stations
                 <fctr>      <num>      <int>
1:                  Low       71.8        242
2:               Medium       71.8        250
3:                 High       66.7        242

Conclusions

Key Findings

  1. Infrastructure and e-bike usage have an inverse relationship: High-infrastructure stations show ~67% e-bike usage vs ~72% at low-infrastructure stations.

  2. Geographic pattern: Lower Manhattan (high infrastructure) shows more classic bike usage, while Upper Manhattan (low infrastructure) shows more e-bike usage.

  3. The explanation: High-infrastructure areas attract more riders, creating higher demand. E-bikes are more popular, so they get taken first. By the time many riders arrive, only classic bikes remain.

  4. Pattern holds across conditions: Both casual riders and members show the same trends, with evening and low-traffic periods having the highest e-bike availability.

Connection to Overarching Question

“Does bike-share usage respond more to infrastructure or external factors?”

For bike type choice, infrastructure affects availability indirectly through demand. High-infrastructure areas have high demand, which depletes e-bikes—forcing riders onto classic bikes. This is an infrastructure factor, but driven by usage patterns rather than bike lanes themselves.

Conclusion: Infrastructure doesn’t directly determine bike type preference, but it shapes availability by concentrating demand. Riders in high-infrastructure areas may want e-bikes but are forced onto classic bikes because of supply constraints.